Suggestions for Implementing a Fast Ieee Multiply-add-fused Instruction

نویسندگان

  • Nhon Quach
  • Michael Flynn
  • Nhon T. Quach
  • Michael J. Flynn
چکیده

We studied three possible strategies to overlap the operations in a floating-point add (FADD) and a floating-point multiply (FMPY) for implementing a multiply-add-fused (MAF) instruction, whose result would be compatible with the IEEE floating-point standard. The operations in FMPY and FADD are: (a) non-overlapped, (b) fully-overlapped, and (c) partially-overlapped. The first strategy corresponds to multiply-add-chained (MAC) widely used in vector processors. The second (Greedy) strategy uses a greedy algorithm, yielding an implementation similar to the IBM RS/SOOO one. The third and final (SNAP) strategy uses a less aggressive starting configuration and corresponds to the Stanford Nanosecond Arithmetic Processor (SNAP) implementation. Two observations have prompted this study. First, in the IBM RS/GOOO implementation, the design tradeoffs have been made for high internal data precision, which facilitates the execution of elementary functions. These tradeoff decisions, however, may not be valid for an IEEE-compatible MAF. Second, the RS/6000 implementation assumed a different critical path for FADD and FMPY, which does not reflect the current state-of-the-art in floating-point technology. Using latency and hardware costs as the performance metrics we show that: (1) MAC has the lowest FADD latency and consumes the least hardware. But its MAF latency is the highest. (2) Greedy has an intermediate MAF latency but the highest FADD latency. And finally (3) SNAP provides the lowest MAF latency at the expense of a small increase in FADD latency over MAC and in area over Greedy. Both Greedy and SNAP have higher design complexity arising from rounding for the IEEE standard. SNAP has an additional wire complexity, which Greedy does not have because of its simpler datapath. If rounding for the IEEE standard is not a requirement, the Greedy strategy and therefore the RS/6000 seems a reasonable middle ground for applications with a high MAF to FADD ratio.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization Effects on Modeling and Synthesis of a Conventional Floating Point Fused Multiply - Add Arithmetic Unit Using CAD

In this paper, a high speed Arithmetic synthesizable Fused Multiply Add Unit (FMA) is modeled capable of implementing the following operations: Addition/subtraction and multiplication. With area speed tradeoff limitation, concentration is on modeling high speed arithmetic units with moderate area increase. Thus, the concentration is on developing units that share the same hardware. A model of a...

متن کامل

Impact on Performance of Fused Multiply-Add Units in Aggressive VLIW Architectures

Loops are the main time consuming part of programs based on floating point computations. The performance of the loops is limited either by recurrences in the computation or by the resources offered by the architecture. Several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating-point units, that reduces the latency of the combined operation and the...

متن کامل

Multiply - Add Optimized Fft Kernels

Modern computer architecture provides a special instruction|the fused multiply-add (FMA) instruction|to perform both a multiplication and an addition operation at the same time. In this paper newly developed radix-2, radix-3, and radix-5 FFT kernels that e ciently take advantage of this powerful instruction are presented. If a processor is provided with FMA instructions, the radix-2 FFT algorit...

متن کامل

Floating-Point Single-Precision Fused Multiplier-adder Unit on FPGA

The fused multiply-add operation improves many calculations and therefore is already available in some generalpurpose processors, like the Itanium. The optimization of units dedicated to execute the multiply-add operation is therefore crucial to achieve optimal performance when running the overlying applications. In this paper, we present a single-precision floating-point fused multiply-add opt...

متن کامل

Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures

ÐLoops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998